K-Best Suffix Arrays

نویسندگان

  • Kenneth Ward Church
  • Bo Thiesson
  • Robert Ragno
چکیده

Suppose we have a large dictionary of strings. Each entry starts with a figure of merit (popularity). We wish to find the kbest matches for a substring, s, in a dictinoary, dict. That is, grep s dict | sort –n | head –k, but we would like to do this in sublinear time. Example applications: (1) web queries with popularities, (2) products with prices and (3) ads with click through rates. This paper proposes a novel index, k-best suffix arrays, based on ideas borrowed from suffix arrays and kdtrees. A standard suffix array sorts the suffixes by a single order (lexicographic) whereas k-best suffix arrays are sorted by two orders (lexicographic and popularity). Lookup time is between log N and sqrt N. 1 Standard Suffix Arrays This paper will introduce k-best suffix arrays, which are similar to standard suffix arrays (Manber and Myers, 1990), an index that makes it convenient to compute the frequency and location of a substring, s, in a long sequence, corpus. A suffix array, suf, is an array of all N suffixes, sorted alphabetically. A suffix, suf[i], also known as a semi-infinite string, is a string that starts at position j in the corpus and continues to the end of the corpus. In practical implementations, a suffix is a 4byte integer, j. In this way, an int (constant space) denotes a long string (N bytes). The make_standard_suf program below creates a standard suffix array. The program starts with a corpus, a global variable containing a long string of N characters. The program allocates the suffix array suf and initializes it to a vector of N ints (suffixes) ranging from 0 to N−1. The suffix array is sorted by lexicographic order and returned. int* make_standard_suf () { int N = strlen(corpus); int* suf = (int*)malloc(N * sizeof(int)); for (int i=0; i<N; i++) suf[i] = i; qsort(suf, N, sizeof(int), lexcomp); return suf;} int lexcomp(int* a, int* b) { return strcmp(corpus + *a, corpus + *b);} This program is simple to describe (but inefficient, at least in theory) because strcmp can take O(N) time in the worst case (where the corpus contains two copies of an arbitrarily long string). See http://cm.bell-labs.com/cm/cs/who/doug/ssort.c for an implementation of the O(N log N) Manber and Myers algorithm. However, in practice, when the corpus is a dictionary of relatively short entries (such as web queries), the worst case is unlikely to come up. In which case, the simple make_suf program above is good enough, and maybe even better than the O(N log N) solution. 1.1 Standard Suffix Array Lookup To compute the frequency and locations of a substring s, use a pair of binary searches to find i and j, the locations of the first and last suffix in the suffix array that start with s. Each suffix between i and j point to a location of s in the corpus. The frequency is simply: j − i + 1. Here is some simple code. We show how to find the first suffix. The last suffix is left as an exercise. As above, we ignore the unlikely worst

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computing Longest Common Substrings Via Suffix Arrays

Given a set of N strings A = {α1, . . . , αN} of total length n over alphabet Σ one may ask to find, for a fixed integer K, 2 ≤ K ≤ N , the longest substring β that appears in at least K strings in A. It is known that this problem can be solved in O(n) time with the help of suffix trees. However, the resulting algorithm is rather complicated. Also, its running time and memory consumption may de...

متن کامل

Suffix Arrays on Words

Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T [1, n], taking O(n) time and O(k) space in addition to T . We propose a class-note solution to this problem that achieves such optimal time and space bounds. Word-based versions of indexes achieving the same time/space bounds were already known for suffi...

متن کامل

Suffix arrays: what are they good for?

Recently the theoretical community has displayed a flurry of interest in suffix arrays, and compressed suffix arrays. New, asymptotically optimal algorithms for construction, search, and compression of suffix arrays have been proposed. In this talk we will present our investigations into the practicalities of these latest developments. In particular, we investigate whether suffix arrays can ind...

متن کامل

On the combinatorics of suffix arrays

We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the ch...

متن کامل

Computing suffix links for suffix trees and arrays

We present a new and simple algorithm to reconstruct suffix links in suffix trees and suffix arrays. The algorithm is based on observations regarding suffix tree construction algorithms. With our algorithm we bring suffix arrays even closer to the ease of use and implementation of suffix trees.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007